Analyzing NYC High School Data
Posted on February 12, 2017 in posts
Here I will explore the relationship between SAT scores and demographic factors in New York City public schools. For a brief bit of background, the SAT, or Scholastic Aptitude Test, is a test given to graduating high schoolers in the US every year. The SAT has 3 sections, each of which is worth a maximum of 800 points. The SAT is used by colleges to determine which students to admit. High average SAT scores are usually indicative of a good school.
New York City has published data on the SAT scores of students, along with additional demographic datasets. In the last three missions, we combined the following datasets into a single, clean, Pandas Dataframe:
SAT scores by school -- SAT scores for each high school in New York City. School attendance -- attendance information on every school in NYC. Class size -- class size information for each school in NYC. AP test results -- Advanced Placement exam results for each high school. Passing AP exams can get you college credit in the US. Graduation outcomes -- percentage of students who graduated, and other outcome information. Demographics -- demographic information for each school. School survey -- surveys of parents, teachers, and students at each school. New York City has a significant immigrant population, and is very diverse, so comparing demographic factors such as race, income, and gender with SAT scores is a good way to figure out if the SAT is a fair test. If certain racial groups consistently performed better on the SAT, we would have some evidence that the SAT is unfair, for example.
The Dataframe combined contains all of our data, and is what we'll be using in our analysis.
Read in the data¶
import matplotlib as plt
plt.use('qt4agg')
import pandas as pd
import numpy as np
import re
data_files = [
"ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduation.csv",
"hs_directory.csv",
"sat_results.csv"
]
data = {}
for f in data_files:
d = pd.read_csv("schools/{0}".format(f))
data[f.replace(".csv", "")] = d
Read in the surveys¶
all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252')
d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t", encoding='windows-1252')
survey = pd.concat([all_survey, d75_survey], axis=0)
survey["DBN"] = survey["dbn"]
survey_fields = [
"DBN",
"rr_s",
"rr_t",
"rr_p",
"N_s",
"N_t",
"N_p",
"saf_p_11",
"com_p_11",
"eng_p_11",
"aca_p_11",
"saf_t_11",
"com_t_11",
"eng_t_10",
"aca_t_11",
"saf_s_11",
"com_s_11",
"eng_s_11",
"aca_s_11",
"saf_tot_11",
"com_tot_11",
"eng_tot_11",
"aca_tot_11",
]
survey = survey.loc[:,survey_fields]
data["survey"] = survey
Add DBN columns¶
data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]
def pad_csd(num):
string_representation = str(num)
if len(string_representation) > 1:
return string_representation
else:
return "0" + string_representation
data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(pad_csd)
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]
Convert columns to numeric¶
cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
for c in cols:
data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")
data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]
def find_lat(loc):
coords = re.findall("\(.+, .+\)", loc)
lat = coords[0].split(",")[0].replace("(", "")
return lat
def find_lon(loc):
coords = re.findall("\(.+, .+\)", loc)
lon = coords[0].split(",")[1].replace(")", "").strip()
return lon
data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)
data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")
data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")
Condense datasets¶
class_size = data["class_size"]
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]
class_size = class_size.groupby("DBN").agg(np.mean)
class_size.reset_index(inplace=True)
data["class_size"] = class_size
data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012]
data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"]
data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]
Convert AP scores to numeric¶
cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']
for col in cols:
data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col], errors="coerce")
Combine the datasets¶
combined = data["sat_results"]
combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")
to_merge = ["class_size", "demographics", "survey", "hs_directory"]
for m in to_merge:
combined = combined.merge(data[m], on="DBN", how="inner")
combined = combined.fillna(combined.mean())
combined = combined.fillna(0)
Add a school district column for mapping¶
def get_first_two_chars(dbn):
return dbn[0:2]
combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)
Find correlations¶
correlations = combined.corr()
correlations = correlations["sat_score"]
print(correlations)
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
##combined.corr()["sat_score"][survey_fields].plot.bar()
##correlation[survey_fields].plot.bar
pal = sns.cubehelix_palette(25, start=2, rot=0, dark=0, light=.95, reverse=True)
by_corr = sns.barplot(x=survey_fields, y=correlations[survey_fields], palette=pal)
for item in by_corr.get_xticklabels():
item.set_rotation(90)
sns.despine(left=True, bottom=True)
##combined.plot.scatter(x='saf_s_11', y='sat_score')
sns.set_style("dark")
sns.regplot(x='saf_s_11', y='sat_score', data=combined)
sns.despine(left=True, bottom=True)
Founding:¶
The satifaction of student has a weak positive correlation with student SAT scores for most high schools with corr = 0.34
import folium
from folium import plugins
schools_map = folium.Map(location=[combined['lat'].mean(), combined['lon'].mean()], tiles='Stamen Toner', zoom_start=10)
marker_cluster = folium.MarkerCluster().add_to(schools_map)
for name, row in combined.iterrows():
folium.Marker([row["lat"], row["lon"]], popup="{0}:{1}".format(row["DBN"], row["school_name"])).add_to(marker_cluster)
#schools_map.save('schools.html')
schools_map
schools_heatmap = folium.Map(location=[combined['lat'].mean(), combined['lon'].mean()],tiles='Mapbox Bright', zoom_start=10)
schools_heatmap.add_children(plugins.HeatMap([[row["lat"], row["lon"]] for name, row in combined.iterrows()]))
#schools_heatmap.save("heatmap.html")
schools_heatmap
districts_d = combined.groupby("school_dist").agg(np.mean)
districts_d.reset_index(inplace=True)
districts_d["school_dist"] = districts_d["school_dist"].apply(lambda x: str(int(x)))
def show_district_map(col):
geo_path = 'schools/districts.geojson'
districts = folium.Map(location=[combined['lat'].mean(), combined['lon'].mean()], zoom_start=10)
data = combined
threshold = np.linspace(data[col].min(),
data[col].max(),
6, dtype=int)
threshold = threshold.tolist()
districts.choropleth(
geo_path=geo_path,
data=districts_d,
columns=['school_dist', col],
key_on='feature.properties.school_dist',
fill_color='YlGn',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Safty level of school',
#threshold_scale=threshold,
)
##districts.save("districts.html")
return districts
show_district_map("saf_s_11")
Lower Brooklyn and Astoria have the highest safty score, while parts of Bronx and conjuction area of Brookly and Queens are less safty for students.
show_district_map("sat_score")
As we can see by comparing the two district level maps, it seem safe neighbourhood tend to have a higher SAT score.
import matplotlib.pyplot as plt
import seaborn as sns
races = ['white_per','asian_per','black_per','hispanic_per']
##correlations[races].plot.bar()
sns.set_style('whitegrid')
pal = sns.cubehelix_palette(4, start=.5, rot=-.75)
sns.barplot(x=races, y=correlations[races], palette = pal)
sns.despine(left=True)
sns.jointplot(x=combined['white_per'], y=combined['sat_score'])
sns.jointplot(x=combined['asian_per'], y=combined['sat_score'])
sns.jointplot(x=combined['black_per'], y=combined['sat_score'])
sns.jointplot(x=combined['hispanic_per'], y=combined['sat_score'])
show_district_map("asian_per")
show_district_map("hispanic_per")
print(combined[combined['hispanic_per'] > 95]['SCHOOL NAME'])
print(combined[(combined['hispanic_per'] < 10) & (combined['sat_score'] > 1800)]['SCHOOL NAME'])
Many of the schools above appear to be specialized science and technology schools and only admit students who pass an very competitive entrance exam. It explain why their students do better on the SAT -- they are students from all over New York City who did well on a standardized test.
correlations[['male_per', 'female_per']].plot.bar()
In the plot above, we can see that a higher percentage of females at a school has a positive correlates with SAT score, while the percentage of males at a school correlates with SAT score negatively. Neither correlation is extremely strong.
combined.plot.scatter(x='male_per', y='sat_score')
combined.plot.scatter(x='female_per', y='sat_score')
According to scatterplot, there doesn't seem to be any real correlation between sat_score and female_per. However, there is a big chuck of schools with a high percentage of females between 60 and 70) that have high SAT scores.
print(combined[(combined['female_per']> 60)&(combined['sat_score']>1700)]['SCHOOL NAME'])
These schools appears to be very selective liberal arts schools and maintain high academic standards.
combined['ap_per'] = combined['AP Test Takers ']/combined['total_enrollment']
combined.plot.scatter(x='ap_per', y='sat_score')
It appears there is a relationship between the percentage of students in a school who take the AP exam and their average SAT scores. It's not a very strong correlation.